Inter-rater and test–retest reliability of quality assessments by novice student raters using the Jadad and Newcastle–Ottawa Scales
نویسندگان
چکیده
INTRODUCTION Quality assessment of included studies is an important component of systematic reviews. OBJECTIVE The authors investigated inter-rater and test-retest reliability for quality assessments conducted by inexperienced student raters. DESIGN Student raters received a training session on quality assessment using the Jadad Scale for randomised controlled trials and the Newcastle-Ottawa Scale (NOS) for observational studies. Raters were randomly assigned into five pairs and they each independently rated the quality of 13-20 articles. These articles were drawn from a pool of 78 papers examining cognitive impairment following electroconvulsive therapy to treat major depressive disorder. The articles were randomly distributed to the raters. Two months later, each rater re-assessed the quality of half of their assigned articles. SETTING McMaster Integrative Neuroscience Discovery and Study Program. PARTICIPANTS 10 students taking McMaster Integrative Neuroscience Discovery and Study Program courses. MAIN OUTCOME MEASURES The authors measured inter-rater reliability using κ and the intraclass correlation coefficient type 2,1 or ICC(2,1). The authors measured test-retest reliability using ICC(2,1). RESULTS Inter-rater reliability varied by scale question. For the six-item Jadad Scale, question-specific κs ranged from 0.13 (95% CI -0.11 to 0.37) to 0.56 (95% CI 0.29 to 0.83). The ranges were -0.14 (95% CI -0.28 to 0.00) to 0.39 (95% CI -0.02 to 0.81) for the NOS cohort and -0.20 (95% CI -0.49 to 0.09) to 1.00 (95% CI 1.00 to 1.00) for the NOS case-control. For overall scores on the six-item Jadad Scale, ICC(2,1)s for inter-rater and test-retest reliability (accounting for systematic differences between raters) were 0.32 (95% CI 0.08 to 0.52) and 0.55 (95% CI 0.41 to 0.67), respectively. Corresponding ICC(2,1)s for the NOS cohort were -0.19 (95% CI -0.67 to 0.35) and 0.62 (95% CI 0.25 to 0.83), and for the NOS case-control, the ICC(2,1)s were 0.46 (95% CI -0.13 to 0.92) and 0.83 (95% CI 0.48 to 0.95). CONCLUSIONS Inter-rater reliability was generally poor to fair and test-retest reliability was fair to excellent. A pilot rating phase following rater training may be one way to improve agreement.
منابع مشابه
Assessing the Validity and Reliability of the Persian Version of the Interpersonal Problem Solving Skills Assessment Tool in Schizophrenia
Objective: This study aimed to translate the Assessment of Interpersonal Problem-Solving Skills (AIPSS) into Persian and to evaluate the validity and reliability of the Persian version of AIPSS to use for adults with schizophrenia. Materials & Methods: In this methodological study, the translation process was performed according to the International Quality of Life Assessment (IQOLA) protocol....
متن کاملTest-Retest and Inter-Rater Reliability Study of the Schedule for Oral-Motor Assessment in Persian Children
Objectives: Reliable and valid clinical tools to screen, diagnose, and describe eating functions and dysphagia in children are highly warranted. Today most specialists are aware of the role of assessment scales in the treatment of affected individuals. However, the problem is that the clinical tools used might be nonstandard, and worldwide, there is no integrated assessment performed to assess ...
متن کاملFunctional Movement Screen in Elite Boy Basketball Players: A Reliability Study
Purpose: To investigate the reliability of Functional Movement Screen (FMS) in basketball players. A few studies have compared the reliability of FMS between raters with different experience in athletes. The purpose of this study was to compare the FMS scoring between the beginners and expert raters using video records. Methods: This is a cross-sectional study. The study subjects compris...
متن کاملStandard setting: Comparison of two methods
BACKGROUND The outcome of assessments is determined by the standard-setting method used. There is a wide range of standard-setting methods and the two used most extensively in undergraduate medical education in the UK are the norm-reference and the criterion-reference methods. The aims of the study were to compare these two standard-setting methods for a multiple-choice question examination and...
متن کاملThe Use of Global Rating Scales for OSCEs in Veterinary Medicine
OSCEs (Objective Structured Clinical Examinations) are widely used in health professions to assess clinical skills competence. Raters use standardized binary checklists (CL) or multi-dimensional global rating scales (GRS) to score candidates performing specific tasks. This study assessed the reliability of CL and GRS scores in the assessment of veterinary students, and is the first study to dem...
متن کامل